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Abstract 



This paper presents a new paradigm for signal reconstruction and superresolution, 
Correlation Kernel Analysis (CKA), that is based on the selection of a sparse set of 
bases from a large dictionary of class-specific basis functions. The basis functions that 
we use are the correlation functions of the class of signals we are analyzing. To choose 
the appropriate features from this large dictionary, we use Support Vector Machine 
(SVM) regression and compare this to traditional Principal Component Analysis 
(PCA) for the tashs of signal reconstruction, superresolution, and compression. The 
testbed we use in this paper is a set of images of pedestrians. This paper also presents 
results of experiments in which we use a dictionary of multiscale basis functions and 
then use Basis Pursuit De-Noising to obtain a sparse, multiscale approximation of a 
signal. The results are analyzed and we conclude that 1) when used with a sparse 
representation technique, the correlation function is an effective hernel for image 
reconstruction and superresolution, 2) for image compression, PCA and SVM have 
different tradeoffs, depending on the particular metric that is used to evaluate the 
results, 3) in sparse representation techniques, L\ is not a good proxy for the true 
measure of sparsity, L 0} and 4) the L t norm may be a better error metric for image 
reconstruction and compression than the L 2 norm, though the exact psychophysical 
metric should tahe into account high order structure in images. 
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1 Introduction 

This paper presents Correlation Kernel Analysis (CKA), a new paradigm for signal reconstruction 
and compression that is based on the selection of a sparse set of bases from a large dictionary 
of class-specific basis functions. The concept of sparsity enforces the requirement that, given a 
certain reconstruction error, we should choose the smallest subset of basis functions that yields a 
reconstruction with this error. The problem of signal reconstruction is formulated as one where 
we are given only a small, possibly unevenly sampled, subset of points in a signal where the goal 
is to accurately reconstruct the entire signal. We also investigate a closely related subject, lossy 
compression, that is, given an entire signal of N bits, we see how well we can represent the signal 
with only M <^ N bits of information, using the same general technique. 

The signal approximation problem we present assumes that we have prior information about 
the class of signals we are reconstructing or compressing; this information is in the form of 
the correlation function of the class of signals to which this signal belongs, as defined by a 
representative set of signals from this class (Penev and Atick, 1996; Poggio and Girosi, 1998a; 
Poggio and Girosi, 1998b). For this paper, the signals that we will be looking at are images of 
pedestrians (Papageorgiou, 1997; Oren, et al., 1997; Papageorgiou, et al., 1998). Using an initial 
set of pedestrian images, we compute the correlation function and use the pointwise-dehned 
functions as the dictionary of basis functions from which we can reconstruct subsequent out- 
of-sample images of pedestrians. Our choice of using the correlation kernel can be motivated 
from a Bayesian point of view. We show that, if we assume a gaussian noise process on our 
measurements, the kernel to use, in a Bayesian sense, is the correlation kernel. 
To approximate or reconstruct an image, rather than using the entire set of correlation-based basis 
functions comprising the dictionary - this would result in no compression whatsoever - we choose 
a small subset of the kernels via the criteria of sparsity. We obtain a sparse representation by 
approximating the signal using the Support Vector Machine (SVM) (Boser, Guyon, and Vapnik, 
1992; Vapnik, 1995) formulation of the regression problem. Based on recently reported results 
(Girosi, 1997; Girosi, 1998), we note that this framework is equivalent to using a modified version 
of the Basis Pursuit De-Noising (BPDN) approach of Chen, Donoho, and Saunders (1995) to 
obtaining a sparse representation of a signal. 

We push this paradigm further by investigating the use of dictionaries of multiscale basis functions 
that encode different levels of detail. To obtain a sparse, multiscale approximation of a signal, 
we use BPDN; this leads to improved reconstruction error and a more sparse representation. We 
also show that the empirical results highlight a drawback in using traditional formulations of 
sparsity. 

The results presented in this paper can be useful in low-bandwidth videoconferencing, image 
de-noising, reconstruction in the presence of occlusions, signal approximation from sparse data, 
as well as in superresolving images. It is important to note that the results are not particular to 
image analysis; this technique can also be seen as an alternative to traditional means of function 
approximation and signal reconstruction, such as Principal Components Analysis (PC A), for a 
wider class of signals. 

The paper is organized as follows: in Section 2, we introduce generalized correlation kernels and 
Section 3 provides Bayesian motivation for our choice of kernels. Section 4 describes the concept 
of sparsity and presents both the SVM regression and BPDN formulations of this approach. 
In Section 5, we present results of several image reconstruction experiments using CKA for 
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sparse approximations with the generalized correlation kernels and describe a superresolution 
reconstruction experiment. Section 6 presents results of image compression experiments and a 
comparison between SVM and BPDN on this task. In Section 7, we show results of experiments 
that use a dictionary with basis functions at multiple scales to do lossy image compression 
using BPDN. Section 8 discusses the error norms that our different reconstruction techniques 
use and their psychophysical plausibility. Section 9 summarizes our results and presents several 
observations and open questions. 

2 Generalized Correlation Kernels 

To reconstruct or compress a function /, we use information about the class of pointwise mean- 
normalized signals that / is a part of, derived from a set of representative examples from that 
class. This information is in the form of the correlation function of the signals in the class: 

i?(x, y) = E[(/ a (x) - Mx))(/ a (y) " My))] (1) 

where f a are instances of the class of functions to which / belongs, x and y are coordinates in the 
2-dimensional signal, and // are the point means across the class of functions: //(x) = E[f a (x.)]. 
We can also generate the eigen-decomposition of the symmetric, positive definite correlation 
matrix by solving 



J G?xi?(x,y)</> n (x) = \ n (/) n (y) 



where <f> n are the eigenvectors and X n are the eigenvalues of the system. After generating this 
decomposition, we can write R in the form, 

M 

R i*,y) = J2 K<t>n{*)<t>n{y) (3) 

71 = 1 

where M < oo; this result is due to the spectral theorem. 

The set of functions <f> n are ordered with decreasing positive eigenvalue X n and are normalized to 
form an orthonormal basis for the correlation function of f a . The classical Principal Component 
Analysis (PCA) approach approximates a function / as a linear combination of a finite number, 
M' , of the basis functions <f> n : 

M' 

/(x) = ^Wn(x) (4) 

71 = 1 

where the coefficients h{ are determined so as to minimize the L 2 approximation error of /. 
Poggio and Girosi (1998a) show that the correlation function R, which is positive definite, induces 
a Reproducing Kernel Hilbert Space (RKHS) that allows us to approximate the function / as: 

N 

/(x) = $>i2(x,x,-) (5) 

8 = 1 

where i ranges over pixel locations in the image; R is the reproducing kernel in this space and 
the norm is: 



M „2 



(6) 




Figure 1: Examples of the correlation kernels we can compute. The kernels shown here are 
computed from a set of 924 grey-level 128 X 64 images of pedestrians that have been normalized to 
the same scale and position in the image. Each column shows the kernels, R<i{(x\ = a, x 2 = 6), y), 
for a specific (a, b) where d = 0.0, d = 0.5, and d = 1.0 in the top, middle, and bottom rows, 
respectively. These images demonstrate that d = 1.0 corresponds to a very smooth kernel, while 
d = 0.0 is highly localized. 

We can obtain a wider class of kernels spanning exactly the same space of functions as the 
correlation function in Equation 3 by varying the degree of A n , which in effect controls the prior 
information regarding the strength of each eigenfunction, an observation due to Penev and Atick 
(1996). We therefore define the generalized correlation kernel as: 



Rd(*,y) 
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(7) 



and notice that the parameter d controls the locality of the kernel; for small d } R^ approaches a 
delta function in the space of </>„, and as d gets larger, Rj gets smoother 1 . 



This particular parameterization is one of many possibilities 



Each of these correlation kernels is a function in four variables (2:1,2:2,2/1,2/2) so, to effectively 
visualize them, we hold the X\ and x 2 positions constant and vary j/i and y 2 . Figure f shows 
several examples of the kernels generated with varying d } for a set of 924 grey-level 128 X 64 
images of pedestrians that have been normalized to the same scale and position; this database 
has been used in Papageorgiou (1997), Oren, et al. (1997), and Papageorgiou, et al. (1998). 
Each column shows Rd((xi = a, x 2 = b) } y) for an image where, from the top to bottom rows, 
d = 0.0, d = 0.5, and d = 1.0; for example, the first column shows the kernels for i?d((ll, 10), y). 
The progressive derealization of the kernels when d is varied from 0.0 to 1.0 is evident in these 
figures. 

3 Bayesian Motivation 

Our choice of the correlation function, i?, as the kernel can be motivated from a Bayesian 
perspective; see Wahba (1990) and Poggio and Girosi (1998a) for background material. Consider 
the general regularization problem: 

N 

mm^m=E(^-/( x 0) 2 +7ll/||i- (8) 

In a Bayesian interpretation, the data term is a model of the noise and the stabilizer is a prior on 
the regression function /. If we assume that the data, j/ 8 -, are affected by additive independent 
gaussian noise, then the likelihood has the following form: 

P( y |/)oce-^- 1 ^- / ( x ')) 2 (9) 

and, when we use the correlation kernel i?, the prior probability is: 

P(/)oce-ll / ll«oce-£-i& (10) 

where M < oo. As shown earlier, this corresponds to a representation of the form: 

M 
/(x) = £c^(x) (11) 

71 = 1 

Thus, the stabilizer measures the Mahalanobis distance of / from the mean signal. This also 
corresponds to a zero mean multivariate gaussian density on the Hilbert space of functions defined 
by R and spanned by <f) n , e.g., the space spanned by the principal components introduced in 
Section 2. From a Bayesian point of view, under the assumption of gaussian noise, R is the 
right kernel to use, whenever it is available. It is important to note that in our SVM and BPDN 
formulations, we use gaussian priors but do not assume gaussian additive noise in the data. 

4 Sparsity 

The operational definition of a sparse representation in the context of regression that we will use is 
the smallest subset of elements from a large dictionary of features such that a linear superposition 



of these features can effectively reconstruct the original signal. In this paper, we will focus on 
sparse representations using the correlation kernels introduced in the previous section: 

N' 

/(x) = $>i2(x,x,-) (12) 

8 = 1 

where N' is smaller than the size of the signal. 

Suppose that we have a large dictionary of core building blocks for a class of signals we are 

analyzing. Given a new signal of the same class, obtaining a sparse representation of this signal 

amounts to choosing the smallest subset of building blocks from the dictionary that will allow us 

to achieve a certain level of performance. It is important to note that comparing representations 

for sparsity is only fair for a given performance criterion. 

Here, we present a brief introduction to the concepts of Support Vector Machine regression and 

Basis Pursuit De-Noising as they apply to sparse representations; for a more in depth treatment 

of these subjects, the reader is referred to (Boser, Guyon, and Vapnik, 1992; Vapnik, 1995; 

Burges, 1998; Chen, Donoho, and Saunders, 1995; Girosi, 1997; Girosi, 1998). 

4.1 Support Vector Machine Regression 

Given a kernel K that defines a RKHS and with the appropriate choice of the scalar prod- 
uct induced by K , the empirical risk minimization regularization theory framework suggests to 
minimize the following functional: 

^[/] = ^Ell^-/(x0lli 2 +7ll/ll^ (13) 

8 = 1 

where \\f\W ls as defined in Section 2. This corresponds to minimizing the sum of the empirical 
error measured in L 2 and a smoothness functional. The Support Vector Machine regression 
formulation minimizes a similar functional, differing only in the norm on the data term; instead 
of using the L 2 norm, the following e-insensitive error function, called the L t norm, is used: 

|*,--/(x0| £ = {? f( „ tfi*-/(*oi<* (14) 

1 J v n \ \z t - f(x t )\- e otherwise v ; 

The functional that is minimized is therefore: 

I N 

#[/] = ^£l*-/(*0| £ + 7ll/lk (15) 



8 = 1 



yielding a function of the form: 



N' 



/(x) = $>i2(x,x,-) (16) 



where the coefficients c are obtained by solving a quadratic programming problem (Vapnik, 
1995; Osuna, Freund, and Girosi, 1997; Girosi, 1997). Depending on the value of the sparsity 
parameter 7, the number of c 8 - that differ from zero will be smaller than iV; the data points 
associated with the non-zero coefficients are called support vectors and it is these support vectors 
that comprise our sparse approximation. 



4.2 Basis Pursuit De-Noising 

The Basis Pursuit De-Noising approach of Chen, Donoho, and Saunders (1995) is a means of 
decomposing a signal into a small number of constituent dictionary elements. The functional 
that is minimized consists of an error term and a sparsity term and in the case of arbitrary basis 
functions, </>;, is: 

N 

E[c] = \\f(x)-J2c t Mxi)\\l 2 +M\c\\L 1 (17) 

8 = 1 

In our case, to sparsify Equation 12, the following functional must be minimized (Girosi, 1997; 
Girosi, 1998): 

N 

E[c] = ||/(x) - ]>>i?(x,x 8 )||i 2 + A||c|| Ll (18) 

8 = 1 

yielding an approximation to / that has a similar form to Equation 16. Girosi (1997) shows that 
if, instead of the L 2 norm, we use the norm induced by i?, then Basis Pursuit De-Noising is in 
fact equivalent to Support Vector Machine regression and identical sparse representations are 
obtained. 

This function minimization is formulated as a quadratic programming problem (see Appendix A) 
and can be solved using traditional methods. Appendix B presents a decomposition algorithm 
that allows us to quickly solve this minimization problem even when we have a large dictionary 
of basis functions. 

5 Reconstruction 

In the case of image reconstruction and compression when we do not assume any prior knowledge 
(other than that we are considering images), we can use techniques like JPEG, wavelets, and 
regularization using a spline or gaussian kernel. The focus of this paper is regularization schemes 
for the case where we do have statistical information on the class of functions we are reconstruct- 
ing. When we do have such knowledge, as in the case of the correlational structure of the class to 
which the image to be compressed belongs, we may be able to obtain better compression by using 
this information. As described in the introductory sections, we can use the set of basis functions 
that encode the correlational structure of the class of images we are interesed in reconstructing. 
For a given image that we would like to approximate, we use these class-specific basis functions 
in the SVM formulation to obtain a sparse subset with which we can encode the image. 
The generalized correlation kernels are generated from a training set of 924 grey-level 32 X 
16 images of pedestrians that have been normalized to the same scale and position. We test 
the correlation kernels and the SVM formulation of function approximation by analyzing the 
reconstruction of pedestrian images not in the training set and comparing to the widely used 
PCA technique. The test database of pedestrian images consists of 50 out-of-sample 32 X 16 
grey-level images of frontal and rear views of pedestrians; as in the training set, these images 
have been normalized such that the pedestrian bodies are aligned in the center of the image and 
are scaled to the same size. 
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Figure 2: Out-of-sample L 2 reconstruction error comparison between SVM with correlation kernel 
-Ri.o, SVM with gaussian kernel (cr = 3.0), and PCA, where the input is a random sampling of 
the original image. Each of these figures represents a different sized sampling, (a) | of the image 
as input, (b) | of the image as input, and (c) | of the image as input. 



For the SVM experiments, we use the correlation kernel corresponding to d = 1 .0 as our dictionary 
of basis functions, so the reconstructed signal will be a sparse linear combination of those basis 
functions: 



#i.o(x,y) 



M 

71 = 1 



X 



y 



(19) 



To accurately test the reconstruction performance, we need to measure the ability of the technique 
to reconstruct unseen data and not simply fit the data. For each image in the test set, we 
randomly partition the pixels into a set that has M pixels - the input set, Fi nput - and a set 
consisting of the remaining (N — M) pixels - the test set, F test . 

In the case of the SVM, to find the sparse set of basis functions that minimizes the error over 
the input subset, Fi nput} we obtain the coefficients of reconstruction by minimizing: 
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(c) (d) 

Figure 3: Reconstruction comparison for a higher resolution image (64 X 32) using identical 
random sets of ^th of the original pixels as input; (a) the original image, (b) PCA reconstruction 
with 74 basis functions, (c) SVM reconstruction with 74 basis functions (e = 10 for the SVM), 
(d) locations of the support vectors are denoted as black values. With a small subset of the 
original image as input, the SVM reconstruction is clearly superior to the PCA reconstruction. 
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/( x ) = J2 c i R {*i x * 



8 = 1 
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The portion of the coefficients, c 8 -, that will be is determined by the variable C. 
For PCA-based reconstruction, we minimize L 2 error over Fi nput : 



M N 

m c in Xl \\Finput(*-i) ~ Yl C J^j( X i) 



\L 2 



(22) 



8 = 1 



where Cj is given by the dot product between Fi nput and cf>'; is taken over the M input points: 



Cj = (Fi n p U t, <f>'j) (23) 

Out-of-sample performance in each case is determined by reconstructing the full image and 
measuring the error over the pixels in ftest- We measure performance as the error achieved 
with respect to the number of basis functions used in the above formulations (equivalently, 
reconstruction error versus the sparsity of the representation). In the case of the SVM regression, 



the number of basis functions is varied by changing the e parameter. To compare with PCA- 
based reconstruction, for a given e, we use, as the number of principal components (ie. basis 
functions) for the reconstruction, the number of support vectors found in the SVM formulation. 
In our experiments, the size of the input set is varied as |iV, |iV, and |iV; error is measured in 

L 2 . 

As a benchmark meant to ensure that the performance of the system using SVM with the 
correlation kernels is not due exclusively to the SVM machinery, we also show the results using 
SVM with gaussian kernels, yielding approximations of the form: 

M \ 2 

/(x) = ^c 8 e(^) (24) 

8 = 1 

where the value of a is determined empirically over a small set of images and that same a is used 
throughout. This setting of sigma for all the tests may be limiting the performance of the SVM 
with a gaussian kernel; on the other hand, we are also a priori fixing the locality parameter, d } 
in our choice of correlation kernel. 

The results of these reconstructions, averaged over the 50 out-of-sample images, are shown in 
Figures 2a-c for each case of using |, |, and | of the pixels as input, respectively. The SVM 
reconstructions using different numbers of basis functions were generated by varying e. From 
these performance results, we can see that, even though the PCA formulation minimizes L 2 error 
and SVM regression is minimizing error in the RKHS induced by the epsilon insensitive norm, 
SVM performs better than PCA even when measuring error in L 2 over out-of-sample test data. 
Furthermore, SVM with the correlation kernels performs better than SVM with gaussian kernels, 
showing that the correlation kernels encode important prior information on the pedestrian class. 
The difference in performance is most pronounced for the reconstructions that use the smallest 
input set. 

Figure 3 presents an extreme case where the input data is a random set of only Trrth (6.25%) 
of the image pixels; here, a higher resolution image (64 X 32) is used. The SVM reconstruction 
with correlation kernels recovers more of the structure of the pedestrian than PCA, due to 
the smoothness preserving properties of the SVM approach to function approximation (Vapnik, 
1995). 

5.1 Superresolution 

To further highlight the generalization power of the SVM reconstruction, we can do an experiment 
to determine superresolution capability, that is, reconstructions at a finer level of detail than 
was originally present in the image. Superresolution entails approximating a small image with 
some representation and then sampling that representation at a finer scale to recover the higher 
resolution image. This could be useful if, for instance, we have an image of a person's face that 
is too small for us to be able to recognize who it is; after superresolving the image, the details 
that emerge could allow us to recognize the person. 

This is not possible with our generalized correlation kernels since they are discrete kernels gen- 
erated from high resolution images (64 X 32) and we cannot subsample them. Therefore, to 
superresolve a given 32 X 16 image, we can consider it as a 64 X 32 image sampled every two 
pixels in both dimensions and then use the correlation kernel basis functions defined in the high 
resolution space (64 X 32) to recover the full high resolution image. 

9 




Figure 4: Superresolution reconstruction from a low resolution (32 X 16) sampling; (a) the 
input 32 X 16 image, scaled up to 64 X 32 by direct scaling, (b) the actual 64 X 32 image, 
(c) SVM superresolution reconstruction using 272 basis functions from i?i. (e = 10), (d) PCA 
superresolution reconstruction using 272 basis functions, and (e) cubic spline interpolation. 
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As input to the superresolution technique, we take a low resolution 32 X 16 image of a pedestrian 
and reconstruct it at high resolution (64 X 32). Figure 4 shows (a) an example of a 32 X 16 
image of a pedestrian that has been directly scaled to 64 X 32 and (b) the true 64 X 32 pedestrian 
image. These are compared with (c) the superresolved image, reconstructed at 64 X 32 using the 
SVM with correlation kernels i?i.o, compared against both (d) a PCA reconstruction, and (e) 
a standard cubic spline interpolation reconstruction (Schumaker, 1981). Given the constraints 
presented above as well as the fact that the cubic spline interpolation superresolves the image 
quite well, for this specific experiment, we favor this standard spline technique over the correlation 
kernels. 

6 Compression 

We can also investigate image compression using the set of correlation-based basis functions, 
in the same manner as the reconstruction experiments presented in Section 5. For the task of 
compression, the goal is to approximate the entire given signal / using as few basis functions as 
possible. The experiments are run as before; we compare the SVM regularization approach to 
compression with our benchmark, PCA-based compression. For the SVM approach, we use the 
correlation kernel with d = 1.0 and compare with using SVM with gaussian kernels. Performance 
is measured as the error achieved for a given number of basis functions. The number of basis 
functions that are used in the case of SVM regression are varied by changing the e parameter. 
As in the reconstruction experiments, the number of eigenvectors we use to compare against 
PCA-based compression is the number of support vectors for given level of e. 
Figure 5 plots the reconstruction error against the number of basis functions for three different 
error norms: L 2} Xi, and L t . Comparing the SVM and PCA approaches to compression is less 
conclusive than the reconstruction experiments; the results here depend on the measure of error. 
PCA performs better when measured in L 2 and L\ while SVM wins when measured in L t . The 
L 2 and L t results are not surprising; when error is measured in the norm that a technique is 
minimizing, we would expect that technique to perform better than the others. On the other 
hand, it is not clear which norm results in a reconstructed image that appears more similar to 
the original image; Section 8 contains a discussion of the different norms. 

6.1 Comparing SVM and BPDN 

Girosi (1997, 1998) showed that Basis Pursuit De-Noising is equivalent to Support Vector Ma- 
chines when the L 2 norm in the BPDN formulation is replaced by the norm induced by the 
regularization kernel. Here, we empirically test the effect of the different error norms in the two 
approaches by comparing SVM and BPDN reconstruction error when compressing our test set of 
50 pedestrian images. Both of these techniques are evaluated using the correlation kernel i?i.o- 
Figure 6 graphs the results and indicates that the performance of the two techniques is not iden- 
tical. For representations using large numbers of basis functions, the performance is comparable, 
but BPDN obtains more accurate sparse approximations, when measured in L 2} to the original 
image (where the number of basis functions is less than 100). Again, the reason behind this is 
that we are measuring error in the norm that BPDN is explicitly minimizing. 
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Figure 5: Comparison of compression error between SVM with correlation kernel i?i.o, SVM 
with gaussian kernel, and PCA; (a) L 2 error, (b) L\ error, and (c) L t error. The L t results are 
presented in tabular format. The L 2 and L\ results indicate that performance is comparable 
between SVM with the correlation kernel and PCA for large numbers of basis functions, but the 
SVM generates better sparse approximations (using less than 100 basis functions). 
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Figure 6: A comparison of SVM and BPDN measuring reconstruction error obtained when 
representing pedestrian images as a sparse set of correlation-based basis functions (i?i. ); L 2 
reconstruction error is plotted against the number of basis functions found by each technique. 
The performance of these techniques is comparable for large numbers of basis functions, but 
BPDN obtains better sparse approximations, measured in L 2} to the original images (number of 
basis functions < 100). 

7 Multiscale Representations 

Multiscale representations allow us to represent a signal using successive levels of approximation; 
lower levels of resolution capture the coarse structure of the signal and finer levels resolution 
of resolution encode the details. These representations are standard in the signal processing 
literature (Mallat and Zhang, 1989; Simoncelli and Freeman, 1995; Mallat and Zhang, 1993). 
In our image reconstruction experiments, we have focused on approximating a signal using a 
single kernel with d = 1.0, corresponding to coarse scale features. In certain applications, we 
may be able to derive class-specific basis functions for several scales; this is the case for our 
generalized correlation kernels where, to vary the locality of the basis functions, we simply 
change d. We can then use the sparsihcation paradigm on this larger overcomplete dictionary 
to obtain a sparse approximation of a given signal with a set of basis functions at several scales. 
The SVM formulation for multiple scales has not been derived yet, but Basis Pursuit De-Noising 
can be used with these multiscale dictionaries. 

As introduced in Section 4.2, Basis Pursuit De-Noising is an approach to sparsihcation that 
minimizes a functional containing an term measuring the approximation error in L 2 using a 
linear combination of basis functions and a sparsity term in L\. In our signal and reconstruction 
experiments, where we have focused on using a set of basis functions <f> n that are at a single scale, 
we would minimize: 



N 



^[c] = ||/(x)-E c ^( x Olll + A||c|| Ll 



(25) 



8 = 1 



for some signal /. 

We can formulate the BPDN functional for our case of generating a multiscale representation 

using correlation kernels as follows: 
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Figure 7: Compression error when using multiscale basis functions with BPDN; (a) L 2 error 
plotted against the L norm of the coefficients (ie., the number of basis functions), (b) L 2 error 
plotted against the L\ norm of the coefficients. These graphs imply that, in the context of 
sparsity, the L\ norm is not a good approximation of L . 



E[c] = ||/(x) 



N d D 

Yl Yl ci,dRd(x,*i 

8=1 d=d\ 



+ Allcl 



Lt 



(26) 



where d ranges over the elements of D, the set of scales we are using. 

The experiments compare the performance of the BPDN technique for correlation kernels using 
various numbers of scales: one scale (D = {1.0}), two scales (D = {0.5,1.0}), and four scales 
(D = {0.0,0.5,0.75,1.0}). As before, we run the experiments on our set of 50 out-of-sample 
images of pedestrians. Figure 7a, which plots the average reconstruction error in L 2 against the 
number of basis functions used in the compression, seems to indicate that to achieve a certain 
error rate, fewer scales of basis functions are better. This is counter to our argument for using 
multiple scales of basis functions since we would expect that, with more scales to choose from, 
the minimization technique would be able to obtain a better approximation when choosing basis 
functions from this larger dictionary. 

To explain this apparent inconsistency, Figure 7b plots reconstruction error against the L\ norm 
of the coefficients, which is the measure of sparsity that BPDN minimizes. Here, the desired 
behavior of the one-, two-, and four-scale reconstructions is evident - for a given level of recon- 
struction error, starting with a multiscale dictionary affords a more sparse representation. What 
does this mean? 

The true measure of sparsity is the L norm of the coefficients, or the number of basis functions. 
Since this would lead to an Integer Programming problem which is computationally prohibitive 
for the number of basis functions we are using, the BPDN formulation approximates L by L\. 
These results offer empirical evidence that these norms are in fact very different and L\ is not a 
good approximation of L . 



14 



8 Error Norms for Image Compression 



(a) (b) 

Figure 8: The two different error norms; (a) L 2 norm, (b) L t norm. 

The techniques for basis selection that we present in this paper use fundamentally different 
criteria to represent signals, depending on what functional form the error term takes; PCA 
minimizes the traditional L 2 norm and SVM minimizes L e , an e- insensitive norm (Pontil, et 
al., 1998), both plotted in Figure 8. While the vast majority of reports of image processing 
techniques ascribe to the use of the L 2 norm, it is not clear that this measure of error is the 
"best" for this particular domain. One important caveat: any pixel-based norm, in particular 
all L p , is clearly not the "right" error metric to use since the human visual system takes into 
account higher order image structure; our discussion focuses on choosing the best norm when we 
are restricted to a "pixelwise" cost such as L p or L t . 

In the context of image reconstruction, the L 2 norm penalizes any perturbations from the true 
value, while the L t norm does not penalize values that are within e of the true value, but linearly 
penalizes values lying outside of this region. The difference in these similarity measures is shown 
in Figure 9; Figure 9a has low L 2 error and high L t error, relative to 9c, while Figure 9c has high 
L 2 error and low L t error, relative to 9a; 9b is the true image. The deviations in Figure 9a seem 
to stand out more than those in 9c, but 9c has higher L 2 error. 

How are we to reconcile this seeming inconsistency in what the traditional L 2 error tells us with 
what our brain tells us? It is well known that people cannot perceive differences in intensity 
that are very small (Schade, 1956; Campbell and Robson, 1968; Hess and Howell, 1977). In 
DeVore, et al. (1992), the authors argue that the L\ error norm is a more accurate mathematical 
realization of the norm embedded in the human visual system than the L 2 norm. Fundamental to 
their hypothesis is the structure of the Contrast Sensitivity Threshold (CST) curve that captures 
a person's ability to distinguish an oscillating pattern of increasing frequency at different levels 
of contrast. Their argument determines the value of p for which the L p norm best fits what 
the geometry of the CST curve implies; they find that p = 1 is the best approximation of the 
perceptual system's norm. 

We can combine their results with the fact that at low contrasts in the middle frequencies of the 
CST curve it is nearly impossible to distinguish the different bands, implying the existence of 
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Figure 9: Examples of images with different types of errors; (a) low L 2 error, high L t error, 
relative to image (c); (b) true image; (c) high L 2 error, low L t error, relative to image (a). 

some base threshold. This leads us to postulate that the L t norm may be a more perceptually 
accurate norm than Li, since it encodes both the geometric constraints and threshold evident in 
the CST curve. In the absence of a psychophysical experiment that investigates this hypothesis, 
this conjecture is speculation, of course. 

9 Conclusion 

We have shown that the use of class-specific correlation-based kernels, when combined with the 
notion of sparsity, results in a powerful signal reconstruction technique. In a comparison to 
a traditional method of signal approximation, Principal Components Analysis, our approach 
achieves a more sparse representation for a given level of error. 

For signal compression, the difference in performance between the techniques is not easily eval- 
uated; when using different measures of error, we obtain a different "best" system. The choice 
of a system to use could depend on the characteristics of the different norms. The L 2 norm 
penalizes any difference in reconstruction. On the other hand, the L t norm does not penalize 
differences in the small e-insensitive region around the true value, but linearly penalizes errors 
outside this region. One way of comparing the L 2} Xi, and L t norms could be to decide which is 
a more accurate description of psychophysical measures of similarity between images. Based on 
the arguments presented in Section 8 and the references cited therein, we postulate that the L t 
norm may be the norm we should use in image reconstruction, superresolution, and compression. 
Our approach of using a dictionary of class-specific correlation kernels to obtain sparse represen- 
tation of a signal leads to an interesting question: could this sparse representation that has been 
generated to approximate a signal be used to classify different signals? In other words, is the 
representation of pedestrians via sparse sets of correlation-based basis functions different enough 
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from the representation of other objects (or all other objects), so that it can be used as a model 
for that class of objects? The representations we generate are derived through an argument 
that minimizes error for reconstructing the image. This, however, says nothing about the ability 
of that same representation to be used to differentiate images of different objects. Whether or 
not this can be done is an open question; Appendix C presents a preliminary discussion of this 
approach. 
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A The BPDN QP Formulation 

The Basis Pursuit De-Noising formulation minimizes the following functional: 

N 

||/(x)-^c^(x)||i 2 +A||c|| Ll (27) 

8 = 1 

To make the expansion of Equation 27 easier, we decompose c into its positive and negative 
coefficients: 

c = c+-c- (28) 

where, to enforce the constraint that a coefficient is non-zero in at most one of the vectors, c + 



or c , we nave: 



c+,c" > 



c+c" = Vz = 1 . . . N 
This allows us to write the rewrite the sparsity term as: 

N 

||c|| Ll = l T (c + + c-) = £(c++c-). 

8 = 1 

We therefore expand Equation 27 as: 

N N N 

||/(x)|| 2 -2^c 8 (/(x),^(x)) + ^^c 8Cj (^(x),^(x)) + Al T (c + + c-) (29) 

8 = 1 8 = 1 j = l 

Since ||/(x)|| 2 is a constant, it does not affect the minimization, so we have: 

N N N 

-2^c 8 (/(x),^(x)) + ^^c 8Cj (^(x),^(x)) + Al T (c + + c-) (30) 

8 = 1 8 = 1 j = l 

Letting: 



Vi = (/( X ),<M X )) 
Ma = (^(x),^(x)) 
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we get: 

N N N 

~ 2 E CiVi + E E(4 ~ cr)(4 " C J)(M*), &(x)> + Al T (c+ + c") (31) 

i=l i=l j=l 

Using the following definitions, 

d = (c+,c-) 



Y = (y,-y) 
the first and last terms can be rewritten as: 

-2c T y + Al T c 
= d T (Al - 2Y) 



so we nave: 



N N 

E E(4 " 0(4 " c J)(Mx), M*)) + d T (Al - 2Y) (32) 

i=l 3=1 



Taking: 

/ M -M 

H = 2 1 V -M M 

the hnal form of this QP problem is 



minimize -d T Hd + d T (Al - 2Y) (33) 



subject to the constraints: 



d > (34) 

We compute the M matrix by taking the inner products of different basis functions; the basis 
functions we use are the correlation kernels from Section 2. For notional simplicity, let R(-) refer 
to the correlation kernel with d = 1.0, Q(-) to the kernel with d = 0.5, and P(-) to the kernel 
with d = 0.0. 



R(x,Xi)R(x,Xj)dx = ^A t ^( x )M x i) ) (E^<M X )<M X ^ 
= E E AfcA^ fc (x)</> fc (x 8 ) (<^fe(x), <^(x)) 



k t 

J2 x tMxi)Mx] 
k 
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which corresponds to the correlation kernel with d = 2.0, ie. 

/ i?(x,x 8 )i?(x,Xj)Jx = i? 2 .o(x;,Xj) 
Similarly, we can show that for corresponding choices of basis functions, Q and P, we get: 



/ <3(x,Xi)(3(x,Xj)dx = i?i.o(x 8 ,Xj 
/ P(x,X;)P(x,Xj)dx = Po.o(x 8 ,Xj 



Therefore, the matrix M does not need to be computed on the fly; we can simply store the 
correlation function of the signal and use this at run-time. 

B QP Decomposition Algorithm 

For the Basis Pursuit De-Noising approach to the sparsity problem, the size of the quadratic 
programming problem is directly related to the number of basis functions contained in our 
dictionary of features. The computational limitations come from the size of the matrix H in 
Equation 33; if there are n features in our dictionary, the size of the matrix will be An 2 . Even 
for dictionaries where n is on the order of (9(I0 3 ), the amount of space this matrix takes up 
is immense. We would like to have both a system that uses a rich set of basis functions and 
one that is computationally tractable; for this we develop an active set method that decomposes 
the problem into smaller elements, under the expectation that most basis functions will not be 
included in the final solution. 

The algorithm proceeds by hrst finding a feasible solution in a smaller problem and verifying 
optimality conditions in the original problem. We then check the optimality conditions for this 
point; if the solution is not optimal, the smaller problem is modified by substituting in elements 
that will help reduce the objective function. This process of finding a feasible solution in a 
smaller problem, checking the optimality of this point, and modifying the problem to push it 
towards an optimal point, is iterated until an optimal solution is found. 

The details regarding the optimality conditions and the actual decomposition algorithm are 
presented in the rest of this section. 

B.l Optimality Conditions 

In general terms, the minimization problem is formulated as follows: 

(35) 
subject to the constraints: 



(36) 



minimize 


■■ /(d 


9i (d) 


< 


92(d) 


< 


9m(d) 


< 
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Finding an optimal solution to this problem entails a constrained search in parameter space 
(d) to minimize the objective function in Equation 35 while maintaining the constraints in 
Equation 36. A point in space, d', that satisfies the constraints is called a feasible point. If 
H is positive definite, the objective function we are minimizing is strictly convex so a feasible 
point d' is an optimal solution if it satisfies a set of conditions called the Karush-Kuhn- Tucker 
(KKT) conditions (Karush, 1939; Kuhn and Tucker, 1951; Bazaraa, et al., 1979). For the 
general problem, the KKT conditions are, in addition to the primal feasibility (PF) condition, 
the following: 

V/(d) + E™i ^Vflf.-(d) =0 (DF) 

Vi >0 Vz = l,...,m (DF) (37) 

v t g t (d) =0 Vz = l,...,m (CS) 

where V{ are the Lagrange multipliers of the problem, DF indicates a dual feasibility condition, 
and CS indicates the complementary slackness condition. 
The QP problem we address is: 

minimize -d T Hd + d T C (38) 



subject to the constraints: 



d > 
d < u 



(39) 



which can be placed into the general form as: 

-d <0 (g x ) 



d-ul <0 (g 2 ) ( 4 °) 



The formulas for the KKT conditions for this problem are as follows: 

Vf(d) + fiV gi (d) + uVg 2 (d) =0 
A*flfi(d) = 



ug 2 (d) = (41) 

n > 

v > 



which yield: 



[Hd + C] t - - //,- + Vi 
-fiidi 

Vi(di - u) 
fJ-i 

Vi 


















































Vz = 


1,.. 


.,n 



(42) 



Since in our case H is positive definite, the objective function we are minimizing is convex and, 
if the KKT conditions hold for a feasible point, this point is an optimal solution. 
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B.2 Decomposition Algorithm 

For our particular problem, we are interested in obtaining a solution where the number of non- 
zero elements of d are small in comparison to the number of zero coefficients; this is exactly the 
sparsity criterion. The decomposition algorithm we develop will push the objective function down 
the gradient until a point is reached where the objective function can no longer be decreased. 
To start developing the algorithm, we define an index set I on the variables d and then partition 
I into [B,N], such that the optimality conditions are enforced only in the smaller QP problem 
defined over the variables in B. The vector d is partitioned into du and dj<[, where d{ = Vi G N; 
our goal is to have B index the sparse nonzero coefficients. 

Since we are looking for a sparse representation, du will have relatively few elements; minimizing 
this smaller objective function will be efficient. Since we set d{ = Vz £ N, the value of the 
objective function we get by solving the smaller QP problem is equal to the value of the original 
objective function. For a formal proof showing that improving the cost function defined over the 
sub-problem strictly improves global cost function we are minimizing, see Osuna, et al., (1997). 
After solving the smaller QP problem, we check the KKT conditions to see if this solution is 
optimal. The KKT conditions postulate that for a solution to be optimal, the following must 
hold, for each dr. 



[Hd + C] 



; < 



> iidi = 

= H0<di<u (43) 

< if dj = u 



This means that, for each coefficient dj j £ B, [Hd + C]j = must be true. If, for any d{ i £ N, 
[Hd + C]i < 0, then the addition of d{ to the working set would decrease the objective function 
- the current solution is not optimal. Hence, we exchange each d{ i £ N where [Hd + C] 8 - < 
with a dj j ' £ B where dj = (and dj is therefore not contributing to minimizing the objective 
function); it is easy to see that this pivoting does not change the value of the objective function. 
The algorithm will move down the gradient until it reaches an optimal solution; the stopping 
criterion is that there are no more d{ i £ N with [Hd + C] 8 - < 0. From the KKT conditions, this 
means that [Hd + C] 8 - < Vz £ B and the solution is therefore optimal. 
The decomposition algorithm is as follows: 

1. Partition the variables into d-Q and g?n such that d{ are fixed to Vz £ N. 

2. Solve the smaller QP problem over g?b; since d{ = i £ N, do not affect the value of the 
objective function. 

3. While there is a d{ i £ N such that [Hd + C] 8 - < (ie., the contribution of this variable 
will push down the objective function), we will pivot this with a dj = j ' £ B (i.e. dj is 
not contributing to reducing the objective function). Go to (2) and repeat. 

C Classification 

The pattern classification problem is one where, instead of approximating a signal, we would like 
to decide to which class of patterns that signal belongs. For simplicity, let us say that we are 
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interested in a classification problem where there are two classes, C\ and C 2 - We may be able to 
relate the distinct problems of regression and classification through our use of class-specific basis 
functions. Specifically, we would like to argue that the features of an object class C\ that are 
important for reconstructing elements of that class may be useful for differentiating elements of 
that class from elements of the other class C 2 - More formally, we would like to classify the image 
/(x) but we also know the generalized correlation function i?(x, y) of the set of similar images 
/a(x), from which the correlation function was derived. We can follow the general approach 
of Penev and Atick (1996) where they use the sparsihed kernels computed for regression for 
classification; in our case, we will use a SVM classifier. 

C.l Using the Regression Kernel Rj for Pattern Classification 

Consider the problem of classification applied to images of dimensionality N; here, each real- 
valued pixel corresponds to one dimension. The goal is to learn a mapping g from points in 1Z N 
to a binary variable, C, that indicates the possible classes. In general, this is a difficult task 
because the dimensionality of N is usually large. To make this tractable, we can use the notion 
of sparse representations to compress the "index" space 1Z N into a smaller space that accurately 
approximates the original space. As we have shown in this paper, this can be done using SVM 
regression or BPDN. 

Let us assume that we have found the optimal sparse set of i?d(x, x 8 ) for i = 1 , . . . , N 1 (N 1 << N) 
over the set of images f a . Thus: 

N' 

/ a (x) = $>fi2(x,x,-) (44) 

8 = 1 

where the x 8 - are not computed from the specific images; for instance, they could be generated 
by sparsifying the average image E[f a ]; see remark later). Then we can estimate the coefficients 
a" for each image from 

f„ = Ra° (45) 

(a" = i?tf„) 2 . The matrix R^ can be precomputed at the locations x 8 - given by the sparsihcation 
of the average image. 

In many image classification problems there are two classes: the class of images we are interested 
in, and the class of all other images. The latter class will be associated with a correlation function 
which is translation invariant and rather "generic". It would be advantageous to use both kernels 
within the classifier but it is not clear what is the best way to do it. 



2 The coefficients computed in this way are not the correct ones from the point of view of SVM regression. 
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